{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression Example\n",
    "\n",
    "Let's walk through the steps of the official documentation example. Doing this will help your ability to read from the documentation, understand it, and then apply it to your own problems (the upcoming Consulting Project)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "spark = SparkSession.builder.appName('lr_example').getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.regression import LinearRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load training data\n",
    "training = spark.read.format(\"libsvm\").load(\"sample_linear_regression_data.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interesting! We haven't seen libsvm formats before. In fact the aren't very popular when working with datasets in Python, but the Spark Documentation makes use of them a lot because of their formatting. Let's see what the training data looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+--------------------+\n",
      "|              label|            features|\n",
      "+-------------------+--------------------+\n",
      "| -9.490009878824548|(10,[0,1,2,3,4,5,...|\n",
      "| 0.2577820163584905|(10,[0,1,2,3,4,5,...|\n",
      "| -4.438869807456516|(10,[0,1,2,3,4,5,...|\n",
      "|-19.782762789614537|(10,[0,1,2,3,4,5,...|\n",
      "| -7.966593841555266|(10,[0,1,2,3,4,5,...|\n",
      "| -7.896274316726144|(10,[0,1,2,3,4,5,...|\n",
      "| -8.464803554195287|(10,[0,1,2,3,4,5,...|\n",
      "| 2.1214592666251364|(10,[0,1,2,3,4,5,...|\n",
      "| 1.0720117616524107|(10,[0,1,2,3,4,5,...|\n",
      "|-13.772441561702871|(10,[0,1,2,3,4,5,...|\n",
      "| -5.082010756207233|(10,[0,1,2,3,4,5,...|\n",
      "|  7.887786536531237|(10,[0,1,2,3,4,5,...|\n",
      "| 14.323146365332388|(10,[0,1,2,3,4,5,...|\n",
      "|-20.057482615789212|(10,[0,1,2,3,4,5,...|\n",
      "|-0.8995693247765151|(10,[0,1,2,3,4,5,...|\n",
      "| -19.16829262296376|(10,[0,1,2,3,4,5,...|\n",
      "|  5.601801561245534|(10,[0,1,2,3,4,5,...|\n",
      "|-3.2256352187273354|(10,[0,1,2,3,4,5,...|\n",
      "| 1.5299675726687754|(10,[0,1,2,3,4,5,...|\n",
      "| -0.250102447941961|(10,[0,1,2,3,4,5,...|\n",
      "+-------------------+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "training.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the format that Spark expects. Two columns with the names \"label\" and \"features\". \n",
    "\n",
    "The \"label\" column then needs to have the numerical label, either a regression numerical value, or a numerical value that matches to a classification grouping. Later on we will talk about unsupervised learning algorithms that by their nature do not use or require a label.\n",
    "\n",
    "The feature column has inside of it a vector of all the features that belong to that row. Usually what we end up doing is combining the various feature columns we have into a single 'features' column using the data transformations we've learned about.\n",
    "\n",
    "Let's continue working through this simple example!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# These are the default values for the featuresCol, labelCol, predictionCol\n",
    "lr = LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction')\n",
    "\n",
    "# You could also pass in additional parameters for regularization, do the reading \n",
    "# in ISLR to fully understand that, after that its just some simple parameter calls.\n",
    "# Check the documentation with Shift+Tab for more info!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Fit the model\n",
    "lrModel = lr.fit(training)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Coefficients: [0.00733507102258,0.831375758434,-0.809530795468,2.44119168688,0.519171379529,1.15345919035,-0.298912411281,-0.51285141862,-0.619712827067,0.695615180432]\n",
      "\n",
      "\n",
      "Intercept:0.14228558260358087\n"
     ]
    }
   ],
   "source": [
    "# Print the coefficients and intercept for linear regression\n",
    "print(\"Coefficients: {}\".format(str(lrModel.coefficients))) # For each feature...\n",
    "print('\\n')\n",
    "print(\"Intercept:{}\".format(str(lrModel.intercept)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a summary attribute that contains even more info!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Summarize the model over the training set and print out some metrics\n",
    "trainingSummary = lrModel.summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lots of info, here are a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+\n",
      "|          residuals|\n",
      "+-------------------+\n",
      "|-11.011130022096554|\n",
      "| 0.9236590911176537|\n",
      "|-4.5957401897776675|\n",
      "|  -20.4201774575836|\n",
      "|-10.339160314788181|\n",
      "|-5.9552091439610555|\n",
      "|-10.726906349283922|\n",
      "|  2.122807193191233|\n",
      "|  4.077122222293812|\n",
      "|-17.316168071241652|\n",
      "|-4.5930443439590585|\n",
      "|  6.380476690746936|\n",
      "| 11.320566035059846|\n",
      "|-20.721971774534094|\n",
      "| -2.736692773777402|\n",
      "|-16.668869342528467|\n",
      "|  8.242186378876315|\n",
      "|-1.3723486332690227|\n",
      "|-0.7060332131264666|\n",
      "| -1.159113596999406|\n",
      "+-------------------+\n",
      "only showing top 20 rows\n",
      "\n",
      "RMSE: 10.16309157133015\n",
      "r2: 0.027839179518600154\n"
     ]
    }
   ],
   "source": [
    "trainingSummary.residuals.show()\n",
    "print(\"RMSE: {}\".format(trainingSummary.rootMeanSquaredError))\n",
    "print(\"r2: {}\".format(trainingSummary.r2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train/Test Splits\n",
    "\n",
    "But wait! We've commited a big mistake, we never separated our data set into a training and test set. Instead we trained on ALL of the data, something we generally want to avoid doing. Read ISLR and check out the theory lecture for more info on this, but remember we won't get a fair evaluation of our model by judging how well it does again on the same data it was trained on!\n",
    "\n",
    "Luckily Spark DataFrames have an almost too convienent method of splitting the data! Let's see it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all_data = spark.read.format(\"libsvm\").load(\"sample_linear_regression_data.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Pass in the split between training/test as a list.\n",
    "# No correct, but generally 70/30 or 60/40 splits are used. \n",
    "# Depending on how much data you have and how unbalanced it is.\n",
    "train_data,test_data = all_data.randomSplit([0.7,0.3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+--------------------+\n",
      "|              label|            features|\n",
      "+-------------------+--------------------+\n",
      "|-28.571478869743427|(10,[0,1,2,3,4,5,...|\n",
      "|-28.046018037776633|(10,[0,1,2,3,4,5,...|\n",
      "|-26.736207182601724|(10,[0,1,2,3,4,5,...|\n",
      "| -23.51088409032297|(10,[0,1,2,3,4,5,...|\n",
      "|-23.487440120936512|(10,[0,1,2,3,4,5,...|\n",
      "|-22.949825936196074|(10,[0,1,2,3,4,5,...|\n",
      "|-20.212077258958672|(10,[0,1,2,3,4,5,...|\n",
      "|-20.057482615789212|(10,[0,1,2,3,4,5,...|\n",
      "|-19.884560774273424|(10,[0,1,2,3,4,5,...|\n",
      "|-19.872991038068406|(10,[0,1,2,3,4,5,...|\n",
      "|-19.782762789614537|(10,[0,1,2,3,4,5,...|\n",
      "|-19.402336030214553|(10,[0,1,2,3,4,5,...|\n",
      "| -19.16829262296376|(10,[0,1,2,3,4,5,...|\n",
      "|-18.845922472898582|(10,[0,1,2,3,4,5,...|\n",
      "| -18.27521356600463|(10,[0,1,2,3,4,5,...|\n",
      "|-17.803626188664516|(10,[0,1,2,3,4,5,...|\n",
      "|-17.428674570939506|(10,[0,1,2,3,4,5,...|\n",
      "| -17.32672073267595|(10,[0,1,2,3,4,5,...|\n",
      "|-17.026492264209548|(10,[0,1,2,3,4,5,...|\n",
      "|-16.692207021311106|(10,[0,1,2,3,4,5,...|\n",
      "+-------------------+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "train_data.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+--------------------+\n",
      "|              label|            features|\n",
      "+-------------------+--------------------+\n",
      "|-26.805483428483072|(10,[0,1,2,3,4,5,...|\n",
      "|-22.837460416919342|(10,[0,1,2,3,4,5,...|\n",
      "|-21.432387764165806|(10,[0,1,2,3,4,5,...|\n",
      "| -19.66731861537172|(10,[0,1,2,3,4,5,...|\n",
      "|-17.494200356883344|(10,[0,1,2,3,4,5,...|\n",
      "|-17.065399625876015|(10,[0,1,2,3,4,5,...|\n",
      "| -16.71909683360509|(10,[0,1,2,3,4,5,...|\n",
      "| -16.08565904102149|(10,[0,1,2,3,4,5,...|\n",
      "|-15.732088272239245|(10,[0,1,2,3,4,5,...|\n",
      "|-13.420594775890757|(10,[0,1,2,3,4,5,...|\n",
      "|-12.977848725392104|(10,[0,1,2,3,4,5,...|\n",
      "| -12.92222310337042|(10,[0,1,2,3,4,5,...|\n",
      "|-12.558575788856189|(10,[0,1,2,3,4,5,...|\n",
      "|-12.479280211451497|(10,[0,1,2,3,4,5,...|\n",
      "| -12.46765638103286|(10,[0,1,2,3,4,5,...|\n",
      "|-12.198096564661412|(10,[0,1,2,3,4,5,...|\n",
      "|-11.904986902675114|(10,[0,1,2,3,4,5,...|\n",
      "|-11.857350365429426|(10,[0,1,2,3,4,5,...|\n",
      "|-11.615775265015627|(10,[0,1,2,3,4,5,...|\n",
      "|-11.328415936777782|(10,[0,1,2,3,4,5,...|\n",
      "+-------------------+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "test_data.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "unlabeled_data = test_data.select('features')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+\n",
      "|            features|\n",
      "+--------------------+\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "|(10,[0,1,2,3,4,5,...|\n",
      "+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "unlabeled_data.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we only train on the train_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "correct_model = lr.fit(train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can directly get a .summary object using the evaluate method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "test_results = correct_model.evaluate(test_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+\n",
      "|          residuals|\n",
      "+-------------------+\n",
      "|-27.947989982759562|\n",
      "| -21.48974651218455|\n",
      "|-21.348863706083357|\n",
      "| -19.11295339381348|\n",
      "| -16.37056545252385|\n",
      "|-18.142472725017456|\n",
      "| -16.58506142820683|\n",
      "|-13.240519904071881|\n",
      "| -17.84000416941091|\n",
      "|-12.585063403174502|\n",
      "| -15.18750826797708|\n",
      "| -15.36121416276331|\n",
      "|-13.574499908567173|\n",
      "|-10.963252677902677|\n",
      "|-11.315618505456634|\n",
      "|-14.605252333044671|\n",
      "|-11.894926359717878|\n",
      "|-12.314571866864313|\n",
      "|-12.299976556997414|\n",
      "| -11.33468035340431|\n",
      "+-------------------+\n",
      "only showing top 20 rows\n",
      "\n",
      "RMSE: 10.1830320696341\n"
     ]
    }
   ],
   "source": [
    "test_results.residuals.show()\n",
    "print(\"RMSE: {}\".format(test_results.rootMeanSquaredError))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well that is nice, but realistically we will eventually want to test this model against unlabeled data, after all, that is the whole point of building the model in the first place. We can again do this with a convenient method call, in this case, transform(). Which was actually being called within the evaluate() method. Let's see it in action:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predictions = correct_model.transform(unlabeled_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+\n",
      "|            features|          prediction|\n",
      "+--------------------+--------------------+\n",
      "|(10,[0,1,2,3,4,5,...|  1.1425065542764918|\n",
      "|(10,[0,1,2,3,4,5,...| -1.3477139047347926|\n",
      "|(10,[0,1,2,3,4,5,...|-0.08352405808245028|\n",
      "|(10,[0,1,2,3,4,5,...|  -0.554365221558238|\n",
      "|(10,[0,1,2,3,4,5,...| -1.1236349043594933|\n",
      "|(10,[0,1,2,3,4,5,...|    1.07707309914144|\n",
      "|(10,[0,1,2,3,4,5,...|-0.13403540539825934|\n",
      "|(10,[0,1,2,3,4,5,...|  -2.845139136949609|\n",
      "|(10,[0,1,2,3,4,5,...|   2.107915897171665|\n",
      "|(10,[0,1,2,3,4,5,...| -0.8355313727162558|\n",
      "|(10,[0,1,2,3,4,5,...|   2.209659542584976|\n",
      "|(10,[0,1,2,3,4,5,...|  2.4389910593928907|\n",
      "|(10,[0,1,2,3,4,5,...|  1.0159241197109834|\n",
      "|(10,[0,1,2,3,4,5,...| -1.5160275335488211|\n",
      "|(10,[0,1,2,3,4,5,...| -1.1520378755762275|\n",
      "|(10,[0,1,2,3,4,5,...|  2.4071557683832587|\n",
      "|(10,[0,1,2,3,4,5,...|-0.01006054295723...|\n",
      "|(10,[0,1,2,3,4,5,...| 0.45722150143488616|\n",
      "|(10,[0,1,2,3,4,5,...|  0.6842012919817876|\n",
      "|(10,[0,1,2,3,4,5,...|0.006264416626529094|\n",
      "+--------------------+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "predictions.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, so this data is a bit meaningless, so let's explore this same process with some data that actually makes a little more intuitive sense!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
